New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Data] Normalize block types before internal multi-block operations #43764

Merged

c21 merged 4 commits into ray-project:master from scottjlee:0306-blocktypes

Mar 8, 2024

Contributor

scottjlee commented Mar 6, 2024 •

edited

Loading

Why are these changes needed?

Applying grouping operations on Datasets with different underlying Block types can cause exceptions (e.g. various AttributeErrors) due to BlockAccessors assuming that all input block types are of the same type.

We handle this case by normalizing the blocks (either ArrowBlock or PandasBlock) to the first block type before applying the rest of the grouping/aggregation logic.

Related issue number

Closes #31550
Closes #39206
Closes #39155
Closes #39291

Inspired by #39960

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

scottjlee added 2 commits

March 6, 2024 14:44


          handle multiple block types in zip

Signed-off-by: Scott Lee <[email protected]>


          handle different block types in groupby/agg/sort

9db0d6a

Signed-off-by: Scott Lee <[email protected]>

scottjlee marked this pull request as ready for review

March 7, 2024 00:29

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners

March 7, 2024 00:29

scottjlee assigned c21 and bveeramani

c21 reviewed

View reviewed changes

python/ray/data/_internal/table_block.py Outdated

Comment on lines 225 to 227

+                              # If block types are different, but still both of TableBlock type, try
+                              # converting both to default block type before zipping.
+                              self_default, other_default = self.to_default(), acc.to_default()

Contributor

c21 Mar 7, 2024

can we call normalize_block_types instead to keep code to be consistent?

python/ray/data/_internal/table_block.py

+                              self_default, other_default = self.to_default(), acc.to_default()
+                              return BlockAccessor.for_block(self_default).zip(other_default)
+                          else:
+                              raise ValueError(

Contributor

c21 Mar 7, 2024

In whic case, this ValueError will be triggered?

Contributor Author

scottjlee Mar 7, 2024 •

edited

Loading

for Blocks which do not extend TableBlock class, i think this will be the case. since both ArrowBlock and PandasBlock are TableBlocks themselves, this isn't an issue for these classes, but this would cover any case in which we have other types of Blocks in the future. i can also remove this if we think it's not useful

Contributor

c21 Mar 7, 2024

Got it, it's fine to keep it.

python/ray/data/_internal/table_block.py Outdated

+                      seen_types = set()
+                      for block in blocks:
+                          acc = BlockAccessor.for_block(block)
+                          assert isinstance(acc, TableBlockAccessor), type(acc)

Contributor

c21 Mar 7, 2024

better to throw an actionable error message instead of assert

Member

bveeramani Mar 7, 2024

@c21 do we still use non-table blocks anywhere?

Contributor

c21 Mar 7, 2024

do we still use non-table blocks anywhere?

Actually I don't think we use non-table blocks anywhere.

python/ray/data/_internal/table_block.py Outdated

+                      else:
+                          results = [BlockAccessor.for_block(block).to_default() for block in blocks]
+                      assert all(isinstance(block, type(results[0])) for block in results)

Contributor

c21 Mar 7, 2024

same here.

bveeramani approved these changes

View reviewed changes

python/ray/data/_internal/table_block.py Outdated

+                      seen_types = set()
+                      for block in blocks:
+                          acc = BlockAccessor.for_block(block)
+                          assert isinstance(acc, TableBlockAccessor), type(acc)

Member

bveeramani Mar 7, 2024

@c21 do we still use non-table blocks anywhere?

python/ray/data/tests/test_all_to_all.py Outdated

Comment on lines 124 to 128

+              def test_zip_multiple_block_types(ray_start_regular_shared):
+                  df = pd.DataFrame({"spam": [0]})
+                  ds_pd = ray.data.from_pandas(df)
+                  ds2_arrow = ray.data.from_items([{"ham": [0]}])
+                  assert ds_pd.zip(ds2_arrow).take_all() == [{"spam": 0, "ham": [0]}]

Member

bveeramani Mar 7, 2024

Should we move this to a different test module? I don't think zip is an all-to-all operation?

c21 added the release-blocker label

scottjlee added 2 commits

March 7, 2024 15:51


          address comments

eb6295b

Signed-off-by: Scott Lee <[email protected]>


          Merge branch 'master' into 0306-blocktypes

41dd346

Signed-off-by: Scott Lee <[email protected]>

c21 approved these changes

View reviewed changes

Contributor

c21 left a comment

LG

c21 merged commit 3e4f21f into ray-project:master

9 checks passed

wpm mentioned this pull request

[Data] Error when Aggregation tries to normalize Pandas blocks to Arrow #45599

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

bveeramani bveeramani approved these changes

c21 c21 approved these changes

ericl Awaiting requested review from ericl

scv119 Awaiting requested review from scv119

amogkam Awaiting requested review from amogkam

raulchen Awaiting requested review from raulchen raulchen is a code owner

stephanie-wang Awaiting requested review from stephanie-wang stephanie-wang is a code owner

omatthew98 Awaiting requested review from omatthew98 omatthew98 is a code owner

Labels

release-blocker